In [1]:
import pandas as pd
In [2]:
data_csv = pd.read_csv("titanic.csv")
In [4]:
data_csv.head()
Out[4]:
CSV stands for Comma Separated Values, as the values/variables in .csv files are separated by commas. Similarly, variables/values in .txt filesa are separated by tabs (" "). It is also often called as tab-separated file. To read .txt files in pandas we again use the same read_csv() function, yet this time we pass another argument besides name of the file: the separator (which should be a tab/whitespace for .txt file).
In [6]:
data_txt = pd.read_csv("imagine_lyrics.txt", sep=" ")
In [7]:
data_txt.head()
Out[7]:
In [9]:
data_html = pd.read_html("https://careercenter.am/")
As you can see we receive an error here. The problem is that the read_html() function reads only HTML tables from the website, while no table could be found on careercenter webpage. If you check the source of their website you will see that there is no content. The content is generated trough another file called ccidxann.php. This means we should copy the link to that file and scrape it instead.
In [11]:
data_html = pd.read_html("https://careercenter.am/ccidxann.php")
In [12]:
data_html.head()
Now, the head() function can no longer be used, as our data is saved as a list, rather than a dataframe. So let's just print it.
In [13]:
print data_html
We may check the length of the list to understand how many elements it has. Basically, each element will be one separate table.
In [14]:
len(data_html)
Out[14]:
In [15]:
data_html[0]
Out[15]:
In [16]:
data_html[1]
Out[16]:
In [17]:
data_html[2]
Out[17]:
In [18]:
data_html[3]
Out[18]:
Let's take only the job postings table which had 2 columns as all the others. The first column has only NaN values, so we will chose only the second one and save it as our data for analysis.
In [19]:
data = data_html[0][1]
Now we have a dataframe, which can already be used together with the head() and other functions.
In [20]:
data.head()
Out[20]:
Pandas has also functinos for reading Excel, Stata, SAS, JSON, SQL and other files. You may check the official documentation for details.
In [21]:
data.to_csv("careercenter_data.csv")
We may now go to our folder to check the csv file.